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Abstract 

Every permutation of {1, 2, . . . , N} can be written as the product of two involutions. 
As a consequence, any permutation of the elements of an array can be performed in-place 
using simultaneous swaps in two rounds of swaps. In the case where the permutation 
is the /c-way perfect shuffle we develop two methods for efficiently computing the pair 
^ \ of involutions that accomplishes these swaps. 

The first method works whenever N is a power of k; in this case the time is O(N) 
and space 0(log 2 iV). The second method applies to the general case where N is a 
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multiple of k; here the time is 0(N log AT) and the space is 0(log N). If k = 2 the 
qq ] space usage of the first method can be reduced to 0(log N) on a machine that has a 

If} ■ SADD (population count) instruction. 
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CN ■ 1 Introduction 



Certain useful permutations on strings, represented by linear arrays of iV elements, can 
be realised, without difficulty, in O(N) time and using only O(logiV) extra space to store 
a constant number of variables. Examples are reversals and cyclic shifts. However, the 
permutation known as the perfect shuffle is not so easily achieved within the given time and 
space constraints. 

The perfect shuffle is defined on strings of even length, say N. Assuming the string 
elements occupy positions through N — 1, the element occupying position i is translated 
to position 2i mod N — 1. The effect is illustrated in Figure [TJ This version of the perfect 
shuffle is called an in- shuffle; the end elements are not moved. The version in which they 
are moved is called an out-shuffle. It is only necessary to analyse one form and in this paper 
we only consider the in-shuffle. In the generalised, fc-way, shuffle, the element occupying 
position i is translated to position ki mod N — 1, where = kM for some integer M. 
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Figure 1: A Perfect In-shuffle on 12 elements. 
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The perfect shuffle arises in several computer science applications. For example, certain 
parallel processing algorithms are conveniently executed on a perfect-shuffie-based archi- 
tecture [5] and there is an in-place merging algorithm [3J Q] which reduces the list merging 
problem to the problem of realising the perfect shuffle (see also [7])- 

Shuffle algorithms that conform to the O(N) time and 0(log N) space constraints have 
been described in [3l Section 7] and in [6]. These solutions use a "cycle leader" algorithm 
which requires the generation of "leader" for each of the cycles that make up the permuta- 
tion. An analysis of the cycles comprising the perfect shuffle is given in [2] and in [5] the 
/c-way perfect shuffle on a deck of size N = kM is discussed. 

Our methods are not cycle leader algorithms; rather they are based on identifying pairs 
of independent elements that swap positions. These swaps occur in two rounds, and thus 
each element participates in no more that two swaps. The resulting configurations can be 
represented by diagrams that are reminiscent of comparator networks used for sorting, but 
now the comparators always swap their elements. One might imagine that this would be 
a good way to permute data in hardware since the configurations can be laid out on two 
layers, with the "wires" on one layer going horizontally, and going vertically in the other 
layer. See Figure [2j 

We first present a general strategy for permuting in-place in two rounds of swaps (Section 
2) and then present two instantiations of that strategy (Sections 3 and 4) as applied to the 
perfect shuffle permutation. The two methods are sufficiently simple and efficient that they 
may provide a viable alternative to known methods. The first method works whenever N is 
a power of k; in this case the time is 0(N) and space 0(log 2 N). If k = 2 we also show how 
to adapt it, still in O(N) time, for general N (but not in 2 rounds of swaps). The second 
method applies to the general case where N is only assumed to be a multiple of k; here the 
time is 0(N log N) and the space is 0(log 2 N). The space usage of the first method can be 
reduced to O(logiV) on a machine that has a SADD (population count) instruction. The 
time complexity measures just stated assume that swap, assignment, relational operations 
and elementary arithmetic operations all take 0(1) time. For space complexity we count 
the number of bits required to represent a variable. Furthermore, all the swaps in each 
phase are independent and so, in parallel, the perfect shuffle can be achieved in O(l) time. 

2 General permutations 

An involution is a permutation that is its own inverse. It is well-known that every per- 
mutation can be written as the product of disjoint cycles. Every involution is the disjoint 
product of 1-cycles and 2-cycles. If the permutation consists of one cycle, then we call it a 
circular permutation. 

We will prove in this section that every permutation is the product of two involutions. 
Thus every permutation can be performed in two rounds of (simultaneous) swaps, or se- 
quentially by having each element participate in at most two swaps. Permuting in place 
has been considered previously [5]. It has been noted before that every permutation is the 
product of two sets of transpositions, i.e. cycles of length 2. See, for example, exercise 
10.1.17 in although we do not know the originator of this result. In the theorem below 
we also count the number of factorizations of a circular permutation as a product of invo- 
lutions. This result was also found recently and independently in [10], see their Corollary 
2.5. 

Theorem 2.1. Every circular permutation of n elements can be written as the product of 
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two involutions. Furthermore, there are exactly n such factorizations if n > 2. 

Proof. Since we can relabel the elements of a cycle, it is sufficient to prove the result 
for C n , where C n is the cyclic shift of 0, 1, ... ,n — 1 written in cycle notation as C n = 
(0 1 • • • n—2 n— 1). Define 1^ to be the involution 

I k := (0 fc)(l fe — 1) • • • ([k/2\ \k/2~\) (jfe+1 n-l)(fc+2 n-2) ■ ■ ■ ([(k + n)/2j f(jfe + n)/2l) 

It is easy to check that for, k = 0, 1, . . . , n — 1, the the following equality holds (where A: — 1 
is interpreted as n — 1 when A; = 0). 

C n = hh-i- (1) 

Furthermore, we show below that there are no other factorizations of C n into two involutions. 

Suppose that C n = ST where S and T are involutions, which we will think of as products 
of disjoint 2-cycles, where a 2-cycle can be degenerate in the sense of being of the form (x x). 
The involution S must have a 2-cycle (0 k) for some k. But then T must have the 2-cycle 
(0 k—1) in order that ST(k—l) = S(0) = k. And then S must have the 2-cycle (1 k— 1) in 
order that ST(0) = S(k—1) = 1. Continuing in this manner, alternately inferring 2-cycles 
in S and T we conclude that S contains (0 k)(l k — 1) • • • ([k/2\ \k/2~\) and that T contains 
(0fc-l)(lfc-2)...(L(fc-l)/2j r(fc-l)/2l). 

We now claim that T must have the 2-cycle (k n—1) in order that ST(n—l) = S(k) = 0. 
But then S must have the 2-cycle (k+1 n—1) in order that ST(k) = S(n—1) = k+1. 
Continuing in this manner, alternately inferring 2-cycles in T and S we conclude eventually 
that T contains (k n-l)(k+l n-2) •••([(& + n - l)/2j \(k + n — l)/2]) and S contains 
(jfe+1 n-l)(fc+2 n-2)---(L(fc + n)/2j \{k + n)/2~\). 

Thus we have shown that S = It and T = Ik-i- D 

Corollary 2.2. Every permutation of n elements can be written as the product of two 
involutions. 

Proof. This follows from Theorem 12.11 since it is well-known that every permutation can 
be written as the product of disjoint cycles, and the product of disjoint involutions is 
again an involution. In more detail, suppose that ir = C1C2 • • • is the disjoint cycle 
decomposition of a permutation ir. By the theorem, ir = (SiTi)(<S2?2) • • • (SkTf.) where 
Si and Tj are involutions. Since they are disjoint, TiSj = SjTi whenever % 7^ j. Thus 
7r = (S1S2 ' • ' Sk)(TiT2 ■ ■ ■ Xfc), the product of two involutions. □ 

In order to apply this theorem and its corollary the cycles of the permutation need to 
be identified. In the next sections we will show how to compute the two involutions by two 
separate methods when the permutation is the perfect shuffle. 

3 Computing perfect shuffles 

Initially we assume that iV = k n , where we are doing a /c-way perfect shuffle. Let i = 
(&n-l • • • fri£>o)fc be an n-digit fc-ary number. Define rev p (i) to be the integer whose base-/c 
representation is the same as that of i except that the p least significant digits are reversed. 
For example, in binary, revs(44) = rev3((101100)2) = (101001)2 = 41. Since rev p (rev p (i)) = 
i, the function iev p (i) is an involution. 
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Figure 2: A network corresponding to the algorithm for (a) k = 2, n = 4 on the left, and 
(b) k = 3, n = 3 on the right. 

Now consider the map := rev n (rev n _i(i)) operating on the n-digit /c-ary number i. 
Let j = 6 n _i. Note that 

E(i) = rev n (rev n _i((j'6 n _ 2 • • • 6i6 )fc)) 
= rev n ((jb bi ■ ■ ■ b„- 2 )k) 
= (b n - 2 ■ ■ ■ hboj) k 
= ki-j(k n -l). 

Hence, 

S(t) = ki mod {k n - 1), 

and thus is the k-w&y "perfect shuffle" permutation of {0, 1, . . . , k n — 1}. The perfect 
shuffle can then be seen as two sets of swaps, where the first performs rev n _i and the second 
performs rev n . Since all of the swaps in each of the two phases are independent, in parallel, 
the perfect shuffle can be achieved in O(l) time. Example networks are shown in Figure 
[2j There the digits in the right side column represent i, and those in the left side represent 
the inverse of the function. 

As a sequential algorithm one can execute REvSwAp(n, k, n — 1); REvSwAp(n, k, n) 
where r,evSwap(ti, k, t) is given below. (The a :=: b is the notation introduced by Knuth 
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and used in Sedgewick [12] to denote the swap or exchange operation, i.e. the values of a 
and b are exchanged.) 

The pseudo-code below outlines our algorithm. 

SHUFFLE(n, k) REvSWAP(n, k,t) 

REvSWAP(n, k, n — 1) for i e {0, 1, . . . , k n - 1} do 

REVSWAP(n, k, n) j : = revt(i) 

if i < j then A[i] :=: A[j] 

To generate the (i,iev n (i)) pairs, we use the following lemma to adjust rev n (i)) appro- 
priately as i is incremented. 

Lemma 3.1. Let p be the leftmost digit that changes in the k-ary representation of i when 
i is incremented by one. Then 

rev„(i + 1) = rev n (i) - k n + k n ~ p + k n ~ p ~ 1 . 

Proof. As a fc-ary number, for some < j < k — 1, the numbers i and i + 1 have the form 

i = (■ . . j (fc-l)(fc-l) • • • (fc-l) ) fe and i+l = (...(j+l)00^_0) fc . 

Thus 

reVn(i ) _ rev n (i+l) = (jk^P- 1 + (k-l)(k n -P + ■■■ + k n ~ 1 )) - {(j + l)^" 1 ) 

h.n _ h.n-p 

= -k n ~P~ l + (k-1) 



k-1 

k n — k n ~P~ l — k n ~ p 



□ 



We can realise the computation defined by this Lemma, ignoring the computation of p 
for the moment, by storing a table of the n powers of k, using 0(log 2 N) bits, plus 0(1) 
bits for a constant number of variables, and in constant time per pair. 

The successive p values can be computed by simulating the incrementation of i on a k- 
ary counter and noting which is the leftmost bit to change. This process is constant time per 
incrementation, when amortised over all values of i. As a function of i when k = 2, the value 
p is called the "ruler function", see [8], and can be computed in various other ways. On a 
machine with a "population count" or SADD instruction (e.g., the MMIX machine of Knuth 
[8]), it can be computed in a constant number of machine operations per incrementation. 
It can also be implemented by using a cast on floating point numbers to extract [lg n \ > as 
pointed out in Knuth [8] on page 142 (see equation (55)). We implemented this approach in 
C on a machine with a 2.0 GHz Opteron processor and found that it computed the required 
swaps m under a minute when AT = 2 30 = 1,073,741,824. The program is available at 



http : //webhome . cs .uvic . ca/~ruskey/Publications/Shuf f le/Perf ectShuf fie .html 



How many swaps are done by the algorithm? If n is even then the number of swaps in 
the first phase is k(k n ~ 1 — k n ^ 2 )/2 and the number in the second phase is (k n — k n ^ 2 )/2. If 
n is odd then the number of swaps in the first phase is k(k n ~ 1 — k^ n ~ 1 ^ 2 )/2 and the number 
in the second phase is also (k n — k^ n+1 ^ 2 )/2. In the even case the total is k n - {k + l)k n / 2 /2 
and in the odd case the total is k n - k^ n+1 ^ 2 . Thus in either case the number of swaps is 

n + o(Vn). 
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3.1 Modification for the cases where N is not a power of k 

The preceding analysis solves the problem for the case where the number of elements, N, 
is a power of k. The general case, where N is not a power of k, but still we have N = kM, 
can be reduced to the special case using the technique described in [3l Section 7], at least 
when k = 2. 

The technique may be understood by means of an example. Suppose that N = 2M = 
30 = 2 • 15. Think of the two initial sequences of 15 elements as being broken down into 
subsequences of elements whose sizes are powers of 2 listed in decreasing order according 
to the binary representation of M. Proceeding from most significant bit to least significant 
bit we can do a series of rotations until the subsequences are in their proper positions as 
they would appear in the perfect shuffle. This process is illustrated below. 

Initial string (spaces are only for clarity): 
ssssssss tttt uu v ssssssss tttt uu v 

Move 8 ss: 

ssssssss ssssssss tttt uu v tttt uu v 

Move 4 ts: 

ssssssss ssssssss tttt tttt uu v uu v 

Move 2 us: 

SSSSSSSS SSSSSSSS tttt tttt UU UU V v 



We would now apply our previous perfect shuffle algorithm to the ss, the ts and the us. 
These smaller perfect shuffles could be done in parallel. 

In the general case, there is no need to do a rotation if the corresponding bit in the binary 
representation of M is 0. The elements that were moved in each of the four rotations in the 
example above are underlined. Assuming that M = (b s ■ ■ ■ 6160)2, the number of element 
changes, C, that occur in the successive moves is 

s s 

c = J2 b ^ ■ ■ ■ b ^h < • • • 6i6o) 2 < 

i=l i=l i>l 

Thus the running time is still 0(N) and the space is still 0(log 2 N). However, the totality of 
the rotations can not be done with a constant number of rounds of swaps, and the extension 
of this idea to k > 2 is non-trivial. So we seek some other method to handle the case where 
N is not a power of k. 



M 



< N. 



4 A number- theoretic approach for the fc-way perfect shuffle, 
valid for any N 

In this section we make no restriction on N, other than it be a multiple of k. We will 
show how to use some elementary number theory to obtain explicit expressions for the two 
involutions guaranteed by Corollary 12.21 

Let g = gcd(x, m). Define J r : Z m — > 7L m by 

J r (x) = g (^j mod . 

The following lemma has two important consequences: 
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If gcd (r, m) = 1, then J r is an involution. 

If m = kn — 1, then gcd(/c,m) = 1 and thus Jk(Ji(x)) = kx for all rr E Z m . That is, 
Jfc(Ji(x)) computes the k-w&y perfect shuffle. 



Lemma 4.1. If gcd(r,m) = gcd(s,m) = 1, then 



J r (J s (x)) = xr 1 mod 



mod to. 



Proof. Let g = gcd(x,m). First note that, since gcd ( ^, ( |J mod y ) = 1, we also have 



gcd (j, (§) 1 mod = 1. Thus 



?;cd(m, J s (x)) = gcd ^rn,g ^— ^ mod — J 

(777 ( ( x\ ^ hi 
— , \s I- J mod — 

Cfjl ( f x\ ^ 777 
— > s l( _ ) mod — 

(772 / x\ ^ 777 i 

— , - mod — 
9 \9j 9 J 

. vn ( x\ m 
g ■ gcd — , — mod — 
9 \9J 9 

Tfl x\ 

9 ■ gcd — , - = g. 
9 9) 



We can now compute the composition: 
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□ 

The pseudo-code below outlines our algorithm and examples of its outputs are illus- 
trated in Figure El 
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Figure 3: A network corresponding to the number theoretic algorithm for (a) k = 2, N = 16 
on the left, and (b) k = 3, N = 27 on the right. 

ShuffleA(M, k) modInv(A/, r) 

modInv(M, 1) for i e {1, 2, . . . , k - 2} do 

MODlNV(M, k) g := gcd(x, kM - 1) 

j :={r(x/gy l ) mod {{kM - l)/g)) 
if i < j then A[i] :=: A[j] 

The involutions J\ and take longer to compute than the corresponding bit-reversal 
involutions discussed in the previous section. In particular, a modular inverse can take time 
0(log N) time to compute, via the extended Euclidean algorithm, giving a total worst case 
running time of 0{N log N). 

However, this will be overly pessimistic in practice since not all gcd calculations will be 
worst case. If we count the number of arithmetic operations that are actually used by the 
algorithm, then the true running time appears experimentally to be 0{N loglogN) with 
the worst case examples occurring when ./V is the product of a small number of distinct odd 
primes. 
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5 Final Remarks and Acknowledgement 



Here we mention some open problems. It would be of interest to find other natural classes 
of permutations for which the involutions of Corollary 12 . 21 can be efficiently computed. Note 
that the number of swaps used by the first algorithm when N = 27 and k = 3 is 18, but 
the number used by the second algorithm is 20 (See (b) in Figures [5] and [3J) . For a general 
./V and k, what is the least number required? Finally, what is the expected running time of 
ShuffleA? 

We thank Dominique Gouyou-Beauchamps for pointing out an erroneous sentence in an 
earlier version of this paper. We also thank the referees for their helpful comments. 
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